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ABSTRACT 


MTMn       k/!!^'  i^*"  P"'"""^^  °^  ^S^y  P^^«=l  shared-memory 
MIMD  architccnircs  for  solving  very  large  problems,  novel  challenges 

Z!  i  ^  the  system  software  designer.  The  operating  systfm 
must  endeavor  to  utdue  all  processors  fully,  without  incurring  serial 
bottlenecks  during  coordination  operations.  Critical  code  sections  far 
too  short  or  infrequent  to  cause  performance  penalties  on  today's 
machmes  will  be  of  concern  on  very  large  machines  because  the  cost 
of  each  serial  section  rises  linearly  with  the  number  of  processors 
involved.  Further,  the  control  software  must  provide  basi?  f^Z 
to  support  a  structured  and  natural  style  of  general-purpose  parallel 
programming.  We  present  the  approaches  taken  for  satisfying  these 
requirements  in  the  NYU  Ultracomputer  design.  We  also  describe 
our   current   prelimmary    parallel    operating    sy stem, -  derived   from 

iTac^mputer.'^  """"'  "°^^  °°  ^  eight-processor  prototype 

1.   Introduction 

d...S'°''°"'°K-.^'^''^"''  '°  microelectronics  will  produce,  during  the  current 

m«^hinL  T  '.^'Pu  '  °°^  "  ^'"^  '°  «'°^''^"  ^  extraordinarily  powerful 

machine  composed  of  thousands  of  processors  and  gigabytes  of  memor/ 
Such    a    configuration    would    yield    several    orders    of    magnitude    mo7e 
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performance  than  do  current  supercomputers  from  roughly  the  same 
component  count.  Moreover,  the  number  of  distinct  components  would  be 
quite  small  and  thus  the  design  appears  attractive. 

However,  it  remains  to  be  demonstrated  that  this  potentially  high- 
performance  assemblage  can  be  effectively  utilized.  There  are  two 
dimensions  to  this  challenge.  First,  several  thousand  processors  must  be 
coordinated  in  such  a  way  that  their  aggregate  power  is  applied  to  useful 
computation.  Serial  procedures  in  which  one  processor  works  while  the 
others  wait  become  bottlenecks  that  drastically  reduce  the  power  obtained. 
Indeed,  for  any  highly  parallel  architecture,  the  relative  cost  of  a  serial 
bottleneck  rises  linearly  with  the  number  of  processors  involved.  Second,  the 
machine  must  be  programmable  by  humans.  Effective  use  of  high  degrees  of 
parallelism  will  be  facilitated  by  simple  languages  and  facilities  for  designing, 
writing,  and  debugging  parallel  programs. 

We  propose  that  the  hardware  and  software  design  of  a  highly  parallel 
computer  should  meet  the  following  goals. 

•  Scaling.  Effective  performance  should  scale  upward  to  a  very  high  level. 
Given  a  problem  of  sufficient  size,  an  n-fold  increase  in  the  number  of 
processors  should  yield  a  speedup  factor  of  almost  n. 

•  General  purpose.  The  machine  should  be  capable  of  efficient  execution 
of  a  wide  class  of  algorithms,  displaying  relative  neutrality  with  respect 
to  algorithmic  structure  or  data  flow  pattern. 

•  Programmability.  High-level  programmers  should  not  havc^  to  consider 
the  machine's  low-level  structural  details  in  order  to  write  efficient 
programs.  Programming  and  debugging  should  not  be  substantially 
more  difficult  than  on  a  serial  machine. 

•  Multiprogramming.  The  software  should  be  able  to  allocate  processors 
and  other  machine  resources  to  different  phases  of  one  job  and/or  to 
different  user  jobs  in  an  efficient  and  highly  dynamic  way. 

Achievement  of  these  goals  requires  an  integrated  hardware/software 
approach.  The  burden  on  the  operating  system  designer  is  to  support  a 
programming  model  that  is  high-level  and  flexible,  to  schedule  the  processors 
so  that  the  workload  is  balanced,  and,  most  importantly,  to  avoid  critical 
sections  that  would  constitute  unacceptable  serial  bottlenecks. 

We  will  consider  these  and  subsidiary  issues  in  the  context  of  a  particular 
shared-memory  MIME)^  parallel  architecture,  the  NYU  Ultracomputer.  This 
design,  which  will  be  briefly  outlined  in  the  following  section,  was  driven  by 
the  above  objectives  plus  the  desire  that  the  machine  be  buildable  without 
requiring  unanticipated  advances  in  hardware  technology  or  packaging. 


^In  the  taxonomy  of  Flynn  [66],  MIMD  is  the  category  of  Multiple  Instruction  stream, 
Multiple  Data  stream  computers. 
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Following  the  architectural  description  we  will  consider  issues  in  parallel 
programming  that  impact  the  operating  system,  in  particular  the  system 
services  utilized  by  parallel  code,  and  questions  of  process  and  job 
scheduling.  The  final  section  will  describe  the  data  structures  and  the 
resource  management  and  synchronization  mechanisms  that  have  been 
developed  for  an  ultraparallel  operating  system.  Many  of  these  building 
blocks  have  been  implemented  in  an  operating  system  currently  in  use  on  an 
eight-processor  prototype  Ultracomputer. 

2.   Machine  Architecture 

In  this  section  we  review  the  archltectiiral  model  on  which  the 
Ultracomputer  is  based,  and  illustrate  the  power  of  this  model.  Although 
this  idealized  machine  is  not  physically  realizable,  a  close  approximation  can 
be  built.  Elements  of  the  actual  machine  design  are  described  in  order  to 
illustrate  integrated  hardware/software  mechanisms  for  bottleneck-free 
coordination  of  a  very  large  number  of  processors. 

2.1.  Paracomputers 

An  idealized  parallel  processor,  dubbed  a  "paracomputer"  by  Schwartz 
[80]  and  classified  as  a  WRAM  by  Borodin  and  Hopcroft  [82],  consists  of  a 
number  of  autonomous  processing  elements  (PEs)  sharing  a  central  memory. 
The  model  permits  every  PE  to  read  or  write  a  shared  memory  cell  in  one 
cycle.  In  particiilar,  simultaneous  reads  and  writes  directed  at  the  same 
memory  cell  are  effected  in  a  single  cycle. 

In  order  to  make  precise  the  effect  of  simultaneous  access  to  shared 
memory  we  define  the  serialization  principle,  which  states  that  the  effect  of 
simultaneous  actions  by  the  PEs  is  as  if  the  actions  had  occurred  in  some 
(unspecified)  serial  order.  Thus,  for  example,  a  load  simultaneous  with  two 
stores  directed  at  the  same  memory  cell  will  return  either  the  original  value 
or  one  of  the  two  stored  values,  possibly  different  from  the  value  which  the 
cell  finally  comes  to  contain.  Note  that  simultaneous  memory  updates  are  in 
fact  accomplished  in  one  cycle;  the  serialization  principle  speaks  only  of  the 
effect  of  simultaneous  actions  and  not  of  their  implementation. 

2.2.  The  Fetch-and-Add  Operation 

We  augment  the  paracomputer  model  with  the  "fetch-and-add"  (F&A) 
operation,  a  powerful  interproccssor  synchronization  operation  that  permits 
highly  concurrent  execution  of  operating  system  primitives  and  application 
programs  (see  Gottlieb  and  Kniskal  [81]).  Fetch-and-add  is  essentially  an 
indivisible  add  to  memory;  its  format  is  F&A(V,e),  where  V  is  an  integer 
variable  and  e  is  an  integer  expression.  The  operation  is  defined  to  return 
the  (old)  value  of  V  and  to  replace  V  by  the  sum  V+e.  Moreover, 
concurrent  fetch-and-adds  arc  required  to  satisfy  the  serialization  principle 
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enunciated  above.  Thus  fetch-and-add  operations  simultaneously  directed  at 
V  would  cause  V  to  be  modified  by  the  appropriate  total  increment  while 
each  operation  yields  the  intermediate  value  of  V  corresponding  to  its 
position  in  this  order.  The  following  example  illustrates  the  semantics  of 
fetch-and-add:  Consider  several  PEs  concurrently  executing  F&A(I,1),  where 
I  is  a  shared  variable  used  to  index  into  a  shared  array.  Each  PE  obtains  an 
index  to  a  distinct  array  element  (although  one  cannot  predict  which  element 
will  be  assigned  to  which  PE),  and  I  receives  the  appropriate  total  increment. 

2.3.  The  General  Fetch-and-^  Operation 

Fetch-and-add  is  a  special  case  of  the  more  general  fetch-and-<(>  operation 
(where  4>  niay  be  an  arbitrary  binary  associative  operator)  introduced  by 
Gottlieb  and  Kruskal.  The  classic  test-and-set  and  compare-and-swap 
synchronization  operations  are  both  special  cases  of  fetch-and-4>  as  well,  and 
the  familiar  load  and  store  operations  are  degenerate  cases.  For  example, 
Test&Set(S)  is  equivalent  to  Fetch&OR(S,TRUE),  while  Load(S)  may  be 
accomplished  with  Fetch&'irj(S,*)  where  ir/x.y)  =  x  and  the  value  of  •  is 
immaterial  (and  thus  need  not  be  transmitted). 

Although  experience  to  date  shows  fetch-and-add  to  be  the  most 
important  member  of  this  class,  other  fetch-and-4>  operations  will  be  useful  as 
well  in  the  construction  of  ultraparallel  operating  systems.  For  example, 
fetch-and-or  with  fetch-and-and  provide  a  means  for  atomically  setting  and 
clearing  bit  flags  within  a  word  of  storage.  On  a  standard  architecture 
without  these  operations,  bit  setting  and  clearing  functions  require  critical 
sections  protected  by  semaphores. 

2.4.  The  Power  of  Fetch-and-Add 

Using  the  fetch-and-add  operation  we  can  perform  many  important 
algorithms  in  a  completely  parallel  manner,  i.e.  without  using  any  critical 
sections.  For  example,  as  indicated  above,  concurrent  executions  of 
F&A(I,1)  yield  consecutive  values  that  may  be  used  to  index  an  array.  If  this 
array  is  interpreted  as  a  (sequentially  stored)  queue,  the  values  returned  may 
be  used  to  perform  concurrent  inserts;  analogously  F&A(D,1)  may  be  used 
for  concurrent  deletes.  The  complete  queue  algorithms  contain  checks  for 
overflow  and  underflow,  collisions  between  insert  and  delete  pointers,  etc. 
(see  Gottlieb,  Lubachevsky,  and  Rudolph  [83]).  Forthcoming  sections  will 
indicate  how  such  techniques  can  be  used  to  implement  a  totally  decentralized 
operating  system  scheduler.  We  are  unaware  of  any  other  completely  parallel 
solutions  to  this  problem.  To  illustrate  the  nonserial  behavior  obtained,  we 
note  that  given  a  single  queue  that  is  neither  empty  nor  full,  the  concurrent 
execution  of  thousands  of  inserts  and  thousands  of  deletes  can  all  be 
accomplished  in  the  time  required  for  just  one  such  operation.  The 
importance  of  critical  section  free  queue  management  routines  may  be  seen  in 
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the  following  remark  of  Deo  et  al.  [80]. 

However,  regardless  of  the  number  of  processors  used,  we  expect  that 
algorithm  PPDM  has  a  constant  upper  bound  on  its  speedup,  because  every 
processor  demands  private  use  of  the  Q. 

For  another  example  of  the  power  of  fetch-and-add,  consider  the 
classical  readers-writers  problem  in  which  two  classes  of  processes  are  to 
share  access  to  a  central  data  structure.  One  class,  the  readers,  may  execute 
concurrently;  whereas  the  writers  demand  exclusive  access.  Although  there 
are  many  solutions  to  this  problem,  only  the  fetch-and-add  based  solution 
given  by  Gottlieb,  Lubachevsky,  and  Rudolph  has  the  crucial  property  that 
during  periods  of  no  writer  activity,  no  critical  sections  are  executed^.  Other 
highly  parallel  fetch-and-add-based  algorithms  appear  in  Kalos  [81],  Kruskal 
[81a],  and  Rudolph  [82]. 

2.5.   Hardware  Realization 

The  paracomputer  is  not  physically  realizable,  due  to  fan-in  (and  other) 
limitations.  Furthermore,  memory  modules  operate  sequentially;  only  one 
load  or  store  may  be  satisfied  in  one  cycle.  If  conctirrent  fetch-and-add  or 
load  operations  were  to  be  serialized  at  the  memory  of  a  real  parallel 
computer,  then  we  would  lose  the  advantage  of  parallel  coordination 
algorithms,  having  merely  moved  the  critical  sections  from  the  software  to 
the  hardware. 

In  fact,  a  parallel  processor  closely  approximating  our  idealized 
paracomputer  can  be  built.  In  this  section  we  sketch  the  design  of  the  NYU 
Ultracomputcr,  and  refer  the  reader  to  Gottlieb,  Grishman,  et  al.  [83]  for  a 
more  detailed  description.  The  Ultracomputcr  uses  a  message  switching 
network  with  the  topology  of  Lawrie's  [75]  fl-network  (equivalently,  the  SW 
Banyan  of  Goke  and  Lipovsky  [73])  to  connect  N  =  2°  autonomous  PEs  to  a 
central  shared  memory  composed  of  N  memory  modules  (MMs).  Note  that 
the  direct  single  cycle  access  to  shared  memory  characteristic  of 
paracomputers  is  approximated  by  an  indirect  access  via  a  multicycle 
coimection  network. 

2.5.1.  Q-network  Enhancements  The  manner  in  which  an  ft-network 
can  be  used  to  implement  memory  loads  and  stores  is  well  known  and  is 
based  on  the  existence  of  a  (imique)  path  cormecting  each  PE-MM  pair.  We 
first  enhance  the  basic  H-network  design  by  associating  queues  with  each 


*Most  other  solutions  require  readers  to  execute  small  critical  sections  to  dieck  if  a 
writer  is  active  and  indivisibly  announce  their  own  presence.  The  "eventcount"  mechanism 
of  Reed  and  Kanodia  [79],  although  completely  parallel  in  the  above  sense,  detects  rather 
than  prevents  the  simultaneous  activity  of  readers  and  writers. 
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switch  to  enable  concurrent  processing  of  requests  for  the  same  port'. 

When  concurrent  loads  and  stores  are  directed  at  the  same  memory 
location  and  meet  at  a  switch,  they  can  be  combined  without  introducing  any 
delay  (see  Klappholz  [80],  and  Gottlieb,  Lubachevsky,  and  Rudolph  [83]). 
Combining  requests  reduces  communication  traffic  and  thus  decreases  the 
lengths  of  the  queues  within  the  switches,  leading  to  lower  network  latency 
(i.e.  reduced  memory  access  time).  Since  combined  requests  can  themselves 
be  combined,  the  network  satisfies  the  key  property  that  any  number  of 
concurrent  memory  references  to  the  same  location  can  be  satisfied  in  the 
time  required  for  one  central  memory  access.  It  is  this  property,  when 
extended  to  include  fetch-and-add  operations,  that  permits  the  bottleneck-free 
implementation  of  many  coordination  protocols. 

2.5.2.  Implementing  Fetch-and-add  By  including  adders  in  the 
MMs,  the  fetch-and-add  operation  can  be  easily  implemented:  When 
F&A(X,e)  reaches  the  MM  containing  X,  the  value  of  X  and  the  transmitted 
e  are  brought  to  the  MM  adder,  the  sum  is  stored  in  X,  and  the  old  value  of 
X  is  returned  through  the  network  to  the  requesting  PE.  Since  fetch-and-add 
is  our  sole  synchronization  primitive  and  is  also  a  key  ingredient  in  many 
algorithms,  concurrent  fetch-and-add  operations  will  often  be  directed  at  the 
same  location.  Thus,  as  indicated  above,  it  is  crucial  that  a  design  supporting 
large  numbers  of  processors  not  serialize  this  activity. 

Enhanced  switches  permit  the  network  to  combine  fetch-and-adds  with 
the  same  efficiency  as  it  combines  loads  and  stores.  When  two  fetch-and- 
adds  referencing  the  same  shared  variable,  say  F&A(X,e)  and  F&A(X,f), 
meet  at  a  switch,  the  switch  forms  the  sum  e-l-f,  transmits  the  combined 
request  F&A(X,e-l-f),  and  stores  the  value  e  in  its  local  memory.  When  the 
value  Y  is  returned  to  the  switch  in  response  to  F&A(X,e+f),  the  switch 
transmits  Y  to  satisfy  the  original  request  F&A(X,e)  and  transmits  Y-(-e  to 
satisfy  the  original  request  F&A(X,f).  Assuming  that  the  combined  request 
was  not  further  combined  with  yet  another  request,  we  would  have  Y  =  X; 
thus  the  values  returned  by  the  switch  are  X  and  X+  e,  thereby  effecting  the 
serialization  order  "F&A(X,e)  followed  immediately  by  F&A(X,f)".  The 
memory  location  X  is  also  properly  incremented,  becoming  X-i-e+f.  If  other 
fetch-and-add  operations  updating  X  are  encountered,  the  combined  requests 
are  themselves  combined,  and  the  associativity  of  addition  guarantees  that  the 
procedure  gives  a  result  consistent  with  the  serialization  principle. 

Analogous  hardware  enables  the  machine  to  support  effectively  fetch- 
and-4>  for  other  associative  binary  operators  <{>. 


^The  alternative  adopted  by  Burroughs  [79]  of  killing  one  of  the  two  conflicting  requests 
limits  bandwidth  to  0(N/log  N);  sec  Kruskal  and  Snir  [83]. 
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2.5.3.  Processor  cache  The  impact  of  network  latency  on  performance 
is  reduced  by  associating  a  local  cache  memory  with  each  PE.  Frequently- 
used  program  code  and  data  can  be  accessed  in  (approximately)  a  single 
processor  cycle  when  resident  in  the  cache.  Thus  the  network  latency  is 
eliminated  from  many  memory  accesses,  and  all  accesses  benefit  from  the 
reduced  network  traffic. 

Unfortunately,  cacheing  of  read-write  shared  variables  presents  a 
coherence  problem  among  the  various  caches.  Different  caches  will  in 
general  come  to  contain  different  values  for  the  same  variable,  and  updates 
of  the  corresponding  memory  cell  will  occur  out  of  sequence.  Means  arc 
needed  to  ensure  synchronized,  coherent  access  among  the  multiple  PEs  for 
variables  used  for  coordination  or  data  transmission.  An  obvious 
mechanism,  which  is  employed  in  the  prototype  Ultracomputer,  is  merely  to 
prevent  cacheing  of  read-write  shared  variables.^  The  software  is  required  to 
distinguish  between  shared  and  private  variables,  typically  by  grouping  them 
into  separate  memory-management  segments,  and  communicate 
"cacheability"  information  to  the  hardware  on  each  memory  access. 

Because  the  cache  hit  ratio  is  a  major  factor  in  machine  performance, 
maximizing  cacheability  is  an  important  function  of  the  compiler  and 
operating  system  software.  This  suggests  supporting  a  more  elaborate 
scheme  in  which  shared  variables  that  are  accessed  read-only,  or  accessed 
only  privately  during  a  particular  code  segment,  may  be  cacheable  during 
execution  of  that  segment  (see  McAuliffe  [85]). 

3.   Parallel  Programming  _ 

3.1.   Levels  of  Parallel  Control 

We  consider  applications  programming  of  Ultracomputer-like  parallel 
computers  primarily  from  the  standpoint  of  operating  system  design.  The 
shared-memory  MIMD  model  can  support  a  number  of  different  styles  of 
programming.  Relative  efficiency  and  usefulness  of  these  alternatives  are 
affected  by  performance  issues  relating  to  the  underlying  software 
implementation.  Automatic  parallelization  of  sequential  programs  is  not 
addressed;  we  assume  here  that  programs  are  designed  originally  for  a 
shared-memory  MIMD  computer.^ 


'Because  of  the  stochastic  nature  of  memory  access  through  the  fl-network,  this  may  not 
be  sufficient  to  insure  that  the  synchronization  is  maintained.  If  the  processor  or  cache  is 
capable  of  issuing  memory  requests  and  proceeding  without  waiting  for  acknowledgment 
from  the  network,  then  for  code  sensitive  to  synchronization  a  further  mechanism  (a 
"FENCE"  operation)  is  needed  to  giiarantee  that  updates  arc  sequenced  correctly  (Collier 
[81]). 

'However,  the  automatic  techniques  of  Kuck  (see  Kuck  and  Padua  [79])  and  Kennedy 
(see  Kennedy[80],  Allen  and  Kennedy  [84])  will  be  important  for  running  both  existing  and 
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Much  work  on  concurrent  programming  has  focused  on  a  high-level, 
block-structured  paradigm  of  parallel  process  control.  Under  this  model 
parallel  code  is  structured  in  closed-form  constructs  with  implicit 
synchronization  at  the  end  of  each  parallel  block;  explicit  synchronization  and 
"fork/join"  operations  (Conway  [63],  Dennis  and  Van  Horn  [66])  are 
discouraged  or  disallowed.  The  argument  is  that  the  resulting  parallel  code  is 
simpler,  clearer,  and  easier  to  debug  than  code  with  unrestrained  and 
unstructured  process  creation/destruction  and  synchronization.  Furthermore, 
this  programming  style  obviates  in  many  cases  the  need  for  explicit 
shared/private  declarations.  That  distinction  is  instead  implicit  in  the  usual 
static  scope  rules.  Thus,  for  example,  a  variable  that  is  visible  within  a  block 
defining  an  "iteration"  of  a  parallel  loop,  but  declared  in  a  larger  enclosing 
scope,  is  taken  to  be  shared  during  execution  of  the  parallel  loop;  a  variable 
declared  within  the  block  itself  has  scope  of  an  individual  iteration  and  hence 
is  private.  Finally,  automatic  optimization  of  parallel  code  is  facilitated  when 
the  parallel  structure  of  a  program  can  be  readily  detected  by  the  compiler. 
It  should  be  possible,  for  example,  for  an  optimizing  compiler  to  implement  a 
fine  granularity  of  control  over  the  cacheability  of  data  areas,  with  greater 
reliability  (and  thus  avoidance  of  cache  coherence  errors)  than  could  be 
specified  by  the  programmer. 

However,  the  utility  and  effectiveness  of  this  high-level  parallel 
programming  style  is  not  yet  demonstrated.  Furthermore,  the  volatile 
parallelism  facilitated  by  these  closed-form  parallel  constructs  will  not  be 
needed  in  all  programs.  When  parallel  processes  need  to  synchronize  very 
frequently,  the  overhead  involved  in  the  process  creation  and  termination 
operations  invoked  by  these  constructs  will  become  significant.  This 
overhead  can  be  alleviated  to  some  extent  by  pre-spawning  processes  (sec 
section  3.5).  However,  many  parallel  applications  will  be  more  suited  to  a 
lower-level  programming  style.  One  such  approach  involves  the  initial 
creation  of  a  fixed  number  of  long-lived  processes,  usually  smaller  than  the 
number  of  PEs.  Synchronization,  scheduling,  and  memory  management 
become  entirely  the  responsibility  of  the  application  programmer.  Here  the 
syntactic  structure  of  the  program  provides  no  information  regarding  the 
dynamic  parallelism  or  the  sharing  of  data.  Ramifications  of  this 
programming  model  will  be  discussed  in  section  4.2. 

3.2.   Parallel  Constructs 

While  programming  language  issues  per  se  are  not  germane  to  this 
paper,  for  concreteness  we  present  examples  of  two  commonly  proposed 
high-level  parallel  constructs.  Note  that  programming  in  the  MIMD  parallel 
environment  need  not  be  radically  different  from  conventional  sequential 
programming.     We    consider    parallel    languages    which    are    variants    of 
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conventional  procedural  languages,  augmented  only  with  a  shared/private 
attribute  for  declared  variables  and  a  small  number  of  explicit  parallel  control 
constructs. 

3.2.1.  Control  Structures  The  parallel  loop,  in  which  homogeneous 
iterates  are  executed  in  parallel  instead  of  serially,  is  an  obvious  parallel 
extension  of  the  loop  construct  found  in  every  procedural  language  (sec 
Gosden  [66],  Droughon  et  al.  [67],  Lundstrom  and  Barnes  [80],  and  Davies 
[81]).  It  may  be  stated  as: 

forall  j  from  lb  to  ub  step  inc  do  s 

As  is  the  case  for  a  sequential  loop,  the  number  of  iterations,  /,  performed 
over  the  loop  body,  s,  is  controlled  by  the  variables  lb,  ub,  and  inc. 
However,  the  /  iterations  of  the  body  arc  executed  in  parallel  with  each 
evaluation  using  a  different  value  for  j.  Each  iteration  is  represented  by  a 
different  process;  the  operating  system  arranges  that  each  process  is 
scheduled  on  a  PE  for  execution.  The  loop  construct  provides  implicit 
synchronization;  that  is,  the  statements  following  the  forall  block  will  be 
executed  (serially)  only  after  all  of  the  parallel  iterations  have  completed 
executing  the  loop  body  s. 

The  parallel  compoimd  statement,  or  parallel  block,  contains 
inhomogeneous  statements  that  are  to  be  executed  in  parallel.  It  is  also  a 
popular  structure  and  has  appeared  many  times  (e.g,  Dijkstra  [68],  Brinch- 
Hansen  [73],  and  the  collateral  expression  of  ALGOL  68)  and  may  be 
expressed  as  follows: 

parbegin 
parend 

First,  the  statement  sq  is  executed;  then  statements  s^  through  s„  are  executed 
in  parallel;  finally  statement  5„  +  |  is  executed. 

Neither  construct  contains  explicit  reference  to  processors.  The  PEs 
themselves  are  a  resource  that  is  only  indirectly  available  to  the  program. 
Parallel  constructs  generate  tasks  (also  called  processes)  which  run  on 
processors.  Since  the  degree  of  parallelism  (number  of  conairrently  active 
tasks)  may  be  highly  volatile,  static  assignment  of  processors  to  programs  is 
not  in  general  desirable  (exceptions  will  be  discussed  shortly).  Rather,  tasks 
are  scheduled  on  processors  dynamically.  Potentially,  all  of  the  tasks 
generated  by  a  parallel  construct  could  execute  simultaneously.  Whether  this 
actually  occurs  is  a  function  of  scheduling  policy,  system  load,  and  various 
characteristics  of  the  user  program  and  language  implementation  details. 
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3.2.2.  Barrier  Synchronization  While  it  is  important  to  minimize  the 
idle  processor  time  caused  by  synchronization  among  parallel  tasks,  such 
synchronization  is  occasionally  necessary.  Barrier  synchronization  causes 
each  of  the  tasks  executing  a  common  routine  to  wait  at  the  "barrier",  or 
synchronization  point,  until  all  of  the  tasks  of  a  given  set  (e.g.,  those 
processes  spawned  on  behalf  of  a  particular  parallel  loop)  have  reached  that 
point.  Algorithms  for  barrier  synchronization  have  been  given  by  Rudolph 
[82]  and  Kruskal  [81b]. 

The  barrier  is  specified  with  the  sync  statement.  This  form  of 
synchronization  is  employed,  for  example,  when  serial  loops  executed  by 
parallel  processes  must  execute  in  lock-step. 

3.3.   Implementation  of  Parallel  Constructs 

The  benefits  of  a  "high-level"  programming  style  were  discussed  above. 
We  now  must  consider  whether  that  style  can  be  supported  without  loss  of 
efficiency  or  flexibility.  In  particular,  we  must  ensure  that  opportimities  for 
parallelism  in  programs  are  not  sacrificed  due  to  the  parallel  language 
implementation.  We  shall  see  that  the  granularity  of  parallelizable  tasks  is  a 
crucial  barometer  of  the  effectiveness  of  the  implementation. 

The  basic  mechanism  provided  by  the  operating  system  for  creating 
parallelism  is  the  spawn  operation.  As  such,  it  is  the  natiiral  facility  with 
which  to  support  the  parallel  loop  and  parallel  block  constructs.  Spawn  is 
fundamentally  an  n-way  fork  of  control,  in  which  n  identical  subtasks  are 
created.  The  subtasks  are  made  available  for  scheduling  on  any  available 
PEs,  in  a  manner  to  be  discussed  subsequently.  We  assume  here  that  the 
"parent"  process  waits  for  the  termination  of  its  spawned  "children",  which 
occurs  automatically  at  the  end  of  the  parallel  code  block.  Hence  the  parent 
process  -  and  thus  following  program  statements  —  are  synchronized  with 
the  completion  of  the  spawned  parallel  processes. 

The  translation  of  a  parallel  block  into  runtime  code,  then,  might  make 
use  of  the  following  operating  system  primitives.  We  suppose  that  a  parallel 
loop,  with  n  homogeneous  iterates,  has  been  coded. 

spawn(n)  -  is  executed  by  the  original  (parent)  process.  It  stores  the  value  n 
in  a  shared  children  variable,  for  use  in  synchronizing  at  the  end  of  the  block, 
then  adds  n  items  to  the  system  work  queue. 

wait  -  is  executed  by  the  parent  following  the  spawn.  The  parent  process 
blocks  until  resumed  subsequently  by  an  active  process. 

terminate  -  is  executed  by  the  spawned  child  tasks  at  the  end  of  the  loop 
body,  terminate  records  the  process  termination  with  a  F&A(children,-l) 
operation;  if  this  drives  children  to  zero  then  the  current  process  is  the  last 
terminating  child  and  thus  resumes  the  parent  process.  Otherwise,  terminate 
calls  gettask  to  obtain  new  work. 
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gettask  -   deletes  a  process  from  the  central  work  pool,  and  initiates  its 
execution. 

An  obvious  implementation  for  the  central  work  pool  is  the  fetch-and- 
add  based  parallel-access  queue  which  was  mentioned  earlier. 

3.4.  Memory  Management 

A  major  issue  in  the  implementation  of  block  structured  parallel 
languages  is  that  of  stack  management.  Since  the  iterates  or  statements  of  a 
parallel  loop  or  block  are  executed  concurrently,  a  single  nm-time  stack  is 
insufficient.  Instead,  a  "cactus  stack"  organization  (Hauck  and  Dent  [68])  is 
appropriate.  A  different  stack  frame  is  allocated  to  hold  local  data  and 
control  information  for  each  spawned  task,  and  these  frames  are  all  linked  to 
the  parent's  stack  frame.  There  is  no  serial  overhead  in  creating  or 
destroying  these  private  frames  since,  as  we  shall  see,  concurrent  memory 
requests  for  stack  space  can  be  processed  in  parallel. 

3.5.  Performance  Issues  in  Parallel  Control 

There  are  two  significant  performance  criteria  by  which  any 
implementation  of  these  operating  system  primitives  must  be  evaluated. 
First,  we  wish  to  avoid  algorithms  that  require  time  linear  (or  worse)  with 
respect  to  the  number  of  spawned  tasks.  Effective  exploitation  of  high 
degrees  of  parallehsm  will  quickly  vanish  if  the  scheduling  of  parallel  tasks 
requires  a  serial  operation  that  is  a  linear  function  of  the  degree  of 
parallelism.  Hence  it  is  unacceptable  to  implement  a  spawn  of  n  processes  by 
a  sequence  of  n  insert  operations  on  a  system  task  queue.  Instead,  we  would 
like  to  have  a  spawn  operation  insert  a  single  item  with  "weight"  n  on  the 
task  queue,  in  constant  time.  Such  an  implementation  is  discussed  in  section 
4,1.  A  constant-time  spawn,  a  fully  parallel  (critical-section-free)  gettask, 
and  fully  parallel  memory  allocation  routines  will  avoid  the  phenomenon  of 
diminishing  marginal  benefits  when  the  degree  of  parallelism  is  increased. 

However  the  overhead  (in  absolute  terms)  of  these  operations  is  also 
crucial,  because  it  determines  the  granularity  of  parallel  tasks  for  which  the 
scheduling  mechanisms  presented  so  far  in  this  section-are  useful.  If  we 
estimate  the  total  number  of  instructions  executed  by  spawn,  gettask,  and 
terminate,  including  not  only  the  insert  and  delete  operations  on  the  system 
task  queue  but  also  the  process  instantiation  functions  of  memory  allocation 
(for  private  variables  and  stack  frame),  loading  of  memory -management 
registers,  etc.,  then  it  becomes  clear  that  dynamically  establishing  the  parallel 
context  for  a  forall  or  parbegin  block  is  of  nontrivial  expense.  Although  the 
basic  unit  of  parallelism  provided  by  the  language  constructs  is  the  program 
statement,  it  is  clear  that  spawning  tasks  that  each  execute  a  trivial  statement 
(e.g.,  one  assignment  or  arithmetic  operation)  would  result  in  great 
inefficiencies.    While  measures  can  be  taken  to  reduce  this  overhead,  for 
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example  by  careful  design  of  memory  allocation  and  management 
subsystems,  the  facilities  described  to  this  point  must  be  considered  useful 
only  for  large-granularity  parallel  tasks.  This  limitation  can  be  alleviated  in 
several  ways. 

Parallel  loop  iterates  of  smaller  granularity  can  often  be  supported  with  a 
policy  known  as  chunking.  When  the  number  of  parallel  iterates  (n)  is  much 
larger  than  the  number  of  available  PEs  (N),  or  the  number  of  instructions  in 
the  body  of  the  loop  is  small  compared  to  the  overhead  of  task  creation,  then 
the  loop  body  can  be  transformed  into  a  nested  loop  consisting  of  ik  serial 
iterations,  where  *  <  n/N,  and  the  size  of  the  spawn  changed  to  n/k.  Thus 
the  scheduling  overhead  per  iterate  is  reduced  by  a  factor  of  jfc  invisibly  to  the 
programmer,  without  diminishing  the  effective  parallelism.  The  value  of  Jk  is 
most  appropriately  determined  at  nmtime  as  a  function  of  the  number  of  PEs 
available  and  possibly  of  system  load.  Care  must  be  taken  to  avoid  setting  Jk 
too  high  and  nullifying  the  load-balancing  properties  of  the  scheduling 
mechanism  (see  section  4.1).  Kruskal  and  Weiss  [84]  have  examined  the 
performance  of  such  chunking  policies  and  argue  that  even  very  naive 
schemes  can  perform  acceptably  well.  A  corresponding  policy  for 
inhomogeneous  parallel  blocks  is  to  detect  at  compile  time  cases  where  the 
granularity  of  parallelism  is  too  small  for  efficient  execution,  and  simply 
replace  the  short  parbegin  blocks  by  serial  blocks. 

Effective  granularity  of  programmed  parallel  tasks  can  be  further 
reduced  by  pre-spawning  processes.  Here  either  the  programmer  or  compiler 
causes  a  sufficient  number  of  tasks  to  be  spawned  in  advance  of  any  parallel 
constructs  and  immediately  suspended.  The  forall  or  parbegin  code  then 
activates  these  tasks  by  sending  a  message  or  a  signal,  thus  "creating" 
parallel  threads  of  control  without  the  overhead  of  operating  system  calls. 
Pre-spawning  is  less  effective  when  the  degree  of  parallelism  is  volatile. 

Finally,  we  can  pre-spawn  non-preemptable  tasks  that  busy-wait  until 
needed  for  a  parallel  segment  of  code.  Far  finer  granularities  of  parallel 
tasks  can  thus  be  programmed  with  high-level  constructs,  since  virtually  all  of 
the  overhead  of  process  creation,  scheduling,  and  context  switching  is 
eliminated.  It  remains  to  be  seen  whether  such  a  policy  will  constitute  an 
avenue  toward  optimally  efficient  execution.  _ 

3.6.   Performance  Issues  in  Barrier  Synchronization 

The  sync  operation  may  be  implemented  in  either  of  two  ways, 
analogously  to  the  situation  of  process  spawning.  It  is  always  correct  for  the 
processor  to  suspend  the  current  task  while  waiting  for  the  synchronization  to 
complete,  and  switch  to  another  task.  However,  considerable  operating 
system  overhead  is  incurred.  The  alternative  is  a  busy-waiting 
synchronization  routine  which  simply  loops  testing  a  counter  until  the 
synchronization  condition  is  satisfied.   The  overhead  involved  is  significantly 
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less  for  busy-waiting  than  for  task  switching,  providing  the  waiting  period  is 
short.  However,  busy-waiting  admits  the  possibility  of  deadlock  when  the 
number  of  synchronizing  processes  is  greater  than  the  number  of  available 
processors  (and  no  preemption  is  permitted).  Even  when  deadlock  cannot 
occur  it  is  possible  that  a  number  of  processors  will  loop  unproductively  for 
extended  periods  of  time,  especially  in  a  multiprogramming  environment. 

A  hybrid  sync  implementation,  in  which  each  process  busy-waits  for  a 
short  period  and  then,  if  necessary,  yields  the  PE,  might  be  used  to 
advantage.  Rescheduling  overhead  would  be  avoided  in  those  cases  where 
busy-waiting  is  suitable,  and  deadlock  would  be  prevented.  Synchronization 
has  implications  for  process  and  job  scheduling  that  arc  addressed  in  the 
following  section. 

4.   Process  scheduling 

4.1.   The  "Self-Service"  Paradigm 

Operating  systems  for  imiprocessors  and  some  multiprocessors  often 
contain  an  active  component  which  schedules  use  of  the  processor(s)  by 
assigning  tasks  as  appropriate.  While  this  centralized  approach  assures 
favorable  load  balancing,  it  introduces  a  serial  bottleneck  which  will  limit 
overall  performance  on  highly  parallel  machines.  We  may  alleviate  this  by 
designating  two  or  more  schedulers,  each  managing  a  portion  of  the 
processors.  This  distributed  approach  leads  to  an  interesting  tradeoff:  if  the 
number  of  schedulers  is  small,  scheduling  bottlenecks  arise;  if  the  number  is 
large,  effective  load  balancing  suffers. 

Stochastic  distributed  scheduling  offers  one  solution.  Here  a  work  queue 
is  maintained  for  each  PE.  Whenever  a  task  is  created  (e.g.,  via  spawn)  on 
any  PE,  that  PE  assigns  the  task(s)  on  a  random  basis  to  one  of  the  work 
queues.  In  concert  with  time-slicing,  or  when  certain  constraints  (on  the 
variance  of  task  lengths)  hold,  this  mechanism  can  effectively  balance  the 
load,  possibly  at  a  cost  in  scheduling  overhead  (see  Klappholz  [82]).  It  does 
not  permit  the  constant-time  spawning  of  multiple  processes  discussed  earlier 
since  the  assignment  of  each  task  must  be  randomized  individually. 

The  scheduling  paradigm  adopted  for  process  scheduling  (and  other 
resource  management  functions)  in  the  Ultracomputer  is  that  of  a  self-service 
system.  We  maintain  a  single  central  queue  of  ready  tasks.  Each  processor 
accesses  this  shared  queue  to  obtain  tasks  for  execution  and  to  insert  newly- 
spawned  tasks.  This  self-service  paradigm  relies  on  simultaneous  distributed 
processing  of  centralized  data,  and  is  highly  dependent  on  concurrently- 
accessible  data  structures  that  allow  operations  to  be  performed  without 
locking  or  other  serializing  techniques.  The  fctch-and-add  based  mechanisms 
described  earlier  are  crucial  in  implementing  these  critical-section-free 
operations,  e.g.,  queue  insertion  and  deletion. 
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The  queue  algorithms  used  are  enhanced  variants  of  the  algorithms 
mentioned  earlier  (a  circular  array  with  insert  and  delete  pointers  on  which 
fetch-and-add  is  used  to  obtain  unique  entries).  In  addition  to  providing 
greater  flexibility  in  storage  management  for  queue  entries  and  greater 
protection  against  certain  boundary  conditions,  these  variants  support  three 
important  additional  featiires: 

(1)  Multiplicity.  A  spawn  of  k  processes  is  implemented  by  the  insertion  of  a 
task  template  with  multiplicity  k.  The  queued  item  will  be  "deleted"  k 
times  before  actually  being  removed  from  the  queue. 

(2)  Priorities.  Tasks  may  be  queued  on  the  central  ready  queue  at  any  one 
of  a  (possibly  fixed)  number  of  priorities.  Delete  operations  select  the 
available  item  with  highest  priority.' 

(3)  Interior  removal.  In  order  to  swap  out  a  task  from  memory,  or  to 
prematurely  terminate  a  task,  it  is  occasionally  necessary  to  "delete"  an 
item  from  the  middle  of  a  queue  (Wilson  [85]). 

EHstributed  scheduling  from  a  central  ready  queue  achieves  optimal  load 
balancing  among  the  PEs  and  also  facilitates  multiprogramming.  Each  of 
several  imrelated  jobs  may  contribute  tasks  to  the  ready  queue;  all  will  be 
scheduled  as  PEs  become  available.  When  programs  with  volatile  parallelism 
are  run,  multiprogramming  can  improve  throughput  by  allowing  serial 
sections  and  highly  parallel  sections  of  different  jobs  to  be  overlapped.  If 
desired,  time-shcing  can  be  used  to  improve  the  average  tumaroimd  time  of 
multiprogrammed  parallel  jobs. 

4.2.   Job  Scheduling 

This  simple  model  of  a  central  pool  of  individual  tasks  is  not  adequate  to 
effectively  schedule  all  types  of  parallel  applications.  Many  parallel  program 
designers  will  need  the  capability  to  manage  a  set  of  PEs  at  a  lower  level  than 
that  which  results  from  use  of  implicit  parallel  control  constructs.  Such 
programs  will  be  generally  static  in  their  degree  of  parallelism,  and  will  have 
little  need  of  operating  system  scheduling  facilities  after  an  initial  phase.  In 
fact,  since  such  a  program  can  implement  its  own  internal  scheduling, 
synchronization,  and  memory  management,  it  will  be  desirable  to  eliminate 


^Among  other  more  obvious  functions,  task  priorities  are  needed  in  management  of 
nested  spawns.  The  dynamic  process  structure  of  a  program  in  which  spawns  are  nested 
several  levels  deep  may  be  depicted  as  a  tree,  in  which  spawned  child~  processes  are 
represented  by  nodes  that  are  descendants  of  their  parent  process  node.  If  the  program  is 
executed  such  that  the  process  tree  is  traversed  breadth-first,  then  there  is  a  danger  that 
because  the  processes  at  each  level  are  spawning  more  processes  before  any  process  may 
terminate,  the  capacity  of  memory  or  system  tables  may  be  overwhelmed  by  an  exponential 
explosion  in  instantiated  processes.  This  is  avoided  by  ensuring  depth-first  traversal  of  the 
process  tree,  which  may  be  achieved  by  enqueuing  the  template  for  creation  of  spawned 
children  on  the  ready  queue  at  a  priority  greater  than  that  of  the  parent. 
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the  process  instantiation  and  context  switching  overhead  associated  with 
operating  system  scheduling  services. 

We  propose  to  add  a  variant  of  the  spawn  system  primitive  to  support 
non-preemptable,  related  groups  of  parallel  tasks.  When  granting  a  non- 
prcemptable  spawn  request,  the  operating  system  first  ensures  that  the  total 
number  of  non-preemptable  tasks  will  not  exceed  the  number  of  processors, 
then  arranges  that  the  spawned  tasks  be  assigned  PEs.  Thus  long-running 
tasks  stemming  from  a  particular  spawn  may  be  immune  from  time-slicing 
and  other  forms  of  preemption.  In  order  to  see  why  this  is  needed  we 
consider  a  spectrum  of  "organizational  styles"  of  parallel  programs. 

(1)  On  the  extreme  (static)  end  of  the  scale  are  jobs  that  are  static  in  their 
use  of  PEs  and  are  subject  to  real-time  constraints.  Tasks  associated 
with  such  jobs  would  not  be  preempted  imder  any  ciromistanccs. 

(2)  More  important  is  the  class  of  non-real-time  applications  referred  to 
above,  which  are  designed  to  coordinate  parallel  activities  on  a  fixed 
group  of  PEs.  Many  such  programs  may  be  characterized  by  very 
frequent  internal  synchronization.  Busy-waiting  synchronization  is 
appropriate  because  the  overhead  of  even  occasional  process  blocking 
and  rescheduling  might  be  prohibitive;  furthermore,  if  all  of  the 
constituent  processes  are  guaranteed  to  be  executing  concurrently  then 
busy-waiting  is  efficient.  Such  a  job  may  be  swapped  out,  if  desired 
because  of  turnaround  considerations,  as  long  as  all  of  the  processes  are 
swapped  in  and  out  together. 

(3)  Jobs  displaying  dynamic  parallelism  are  to  some  extent  better  suited  to 
our  original  model  of  task  scheduling.  Although  internal 
synchronization  will  still  be  needed  occasionally,  it  can  be  adequately 
managed  with  the  hybrid  synchronization  mechanism  proposed  above, 
even  in  the  presence  of  chunking.  However  there  remain  several  reasons 
to  arrange  scheduling  mechanisms  that  tend  to  keep  related  ("sibling") 
processes  executing  concurrently.  First,  there  may  exist  algorithms 
whose  performance  is  improved  when  parallel  processes  execute  more  or 
less  evenly,  that  is,  when  scheduling  policies  tend  to  maximize  the 
execution  rate  of  the  slowest-progressing  task  in  the  spawned  set. 
Second,  the  effectiveness  of  the  processor  cache  will  be  improved  when 
successive  tasks  executed  on  the  same  PE  come  from  the  same  job,  since 
they  are  then  likely  to  reuse  cache  entries  for  program  code.' 

The  structiire  of  the  central  ready  queue  is  further  complicated  by  the 
need  to  recognize  groups  of  sibling  tasks  in  swapping  and  scheduling. 
However,  the  original  fetch-and-add  based  notion  of  bottleneck-free  inserts 
and  deletes  is  maintained. 


'Here  we  assume  that  the  cache  architecture  permits  retaining  of  cadie  lines  across  a 
context  switch. 
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5.   Building  the  Parallel  Operating  System 

Since  programs  written  for  an  Ultracomputer-like  machine  are 
anticipated  to  generate  hundreds  or  thousands  of  concuirrent  activities,  we 
must  be  prepared  to  accommodate  a  correspondingly  high  level  of 
simultaneous  requests  for  operating  system  services.  Serial  processing  of 
these  requests  will  generate  unacceptable  bottlenecks  on  a  large  machine. 
Therefore,  the  operating  system  must  itself  be  a  parallel  program, 
decentralized  and  free  of  serial  code  sections  that  would  cause  bottlenecks. 

We  have  already  outlined  the  implementation  of  processor  scheduling  for 
these  constraints.  In  most  respects  the  parallel  operating  system  faces  the 
same  coordination  problems  as  any  parallel  program  that  is  "low  level"  in  the 
sense  defined  earlier.  The  synchronization  primitives  and  data  structures 
described  below  in  the  context  of  the  operating  system  are  equally  applicable, 
with  minor  implementation  differences,  in  user  applications. 

The  algorithms  and  mechanisms  described  have  been  implemented  in  an 
experimental  operating  system  which  is  running  on  a  prototype  (8-processor) 
Ultracomputer.  Based  on  UNIX  Version  7,  the  experimental  system  is 
symmetric  (i.e.,  there  is  no  master-slave  relationship)  as  well  as  parallel.  It 
incorporates  rudimentary  but  usable  facilities  for  parallel  applications 
programs. 

5.1.  Data  Structures 

All  operating  system  functions  rely  heavily  on  the  use  of  shared  data 
structures  that  can  be  efficiently  accessed  in  parallel.  In  this  section  we 
consider  a  few  particular  structures  that  have  proven  useful.  Once  again,  the 
fact  that  multiple  simultaneous  references  to  the  same  memory  location  can 
be  accomplished  in  the  time  required  for  one  reference  allows  us  to  avoid 
serial  bottlenecks. 

Note  that  the  linked  list,  a  popular  data  structure  of  wide  application  in 
(serial)  resource  and  data  management  code,  is  not  suitable  where  concurrent 
manipulation  is  required.  We  know  of  no  algorithm  for  deleting  items  in  a 
linked  list  without  locking  out  some  other  accesses,  or  for  assigning 
individual  list  items  to  different  PEs  without  serialization.  The  desirable 
characteristics  of  linked  lists  must  be  found  in  other  structures. 

5.1.1.  Queues  We  have  seen  the  central  role  played  by  the  fetch-and-add 
based  parallel  queue  in  process  scheduling.  Similar  qveues  are  used  by 
synchronization  primitives:  the  set  of  processes  waiting  for  a  lock  to  be 
released,  or  for  an  event,  are  held  on  a  queue  (see  section  5.3_^2).  Parallel 
queues  are  useful  in  managing  a  parallel-accessed  pool  of  items  even  where 
there  is  no  ordering  required. 

Parallel  queues  used  in  the  prototype  operating  system  are  of  somewhat 
different  structure  from  the  simple  fixed-size  circular  array  discussed  earlier. 
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Queues  of  unbounded  size  are  implemented  by  associating  a  linked  list, 
protected  by  a  semaphore,  with  each  element  of  the  fixed-size  array.  An 
insertion  at  array  element  j  appends  its  item  to  the  list  associated  with 
element  j.  The  maximum  concurrent  access  supported  by  this  structure 
equals  the  number  of  lists,  i.e.,  the  size  of  the  fixed  array  (Rudolph  [82]). 

5.1.2.  Hash  tables  A  similar  structure  results  from  the  use  of  hash 
tables  to  access  indexed  (dictionary)  information.  The  size  of  the  hash  table 
(number  of  buckets)  is  set  according  to  the  desired  maximimi  conctirrency, 
usually  the  number  of  PEs  times  a  small  factor.  Buckets  are  linked  lists, 
protected  by  readers-writers  locks,  so  that  in  a  high  percentage  of  cases  items 
are  accessed  with  no  serialization.  See  section  S.3.1  for  an  example  of  this 
structure. 

5.2.  Memory  Allocation 

The  creation  of  a  new  task  requires  allocation  of  space  for  its  private 
variables  as  well  its  cactus  stack  frame.  As  with  task  management,  we  adopt 
a  self-service  mechanism.  The  processor  that  retrieves  a  task  from  the  ready 
queue  executes  the  memory  allocation  code.  A  number  of  parallel  algorithms 
for  memory  management  have  been  designed,  including  (non-demand) 
paging,  two  variants  of  the  Buddy  System  (Knuth  [68]),  and  a  boundary  tag 
method  (Knuth  [68]).  All  are  parallel  analogs  of  serial  algorithms.  All, 
except  for  one  of  the  Buddy  System  variants,  maintain  queues  (whereas  their 
serial  analogs  keep  lists)  of  free  memory  blocks.  Insertions,  deletions,  and 
accesses  to  blocks  within  a  concurrently-accessible  queue  are  at  the  core  of 
these  algorithms;  as  a  consequence  we  obtain  critical-section-free  memory 
allocation. 

5.3.  Coordination  Primitives 

In  order  to  permit  tasks  to  cooperate,  it  is  often  necessary  to  coordinate 
their  accesses  to  shared  data  structures.  An  ideal  situation  for  parallel 
execution  occurs  when  completely  asynchronous  behavior  is  permitted,  as  in 
some  "chaotic"  algorithms.  When  restricting  access  patterns,  one  must  be 
careful  to  permit  as  much  parallelism  as  possible.  In  this  section  we  discuss 
three  mechanisms  for  task  coordination  that  have  been  successfully  used  in 
our  prototype  multiprocessor  system,  namely  counting  semaphores, 
readers/writers  locks,  and  events.  When  designing  such  access  mechanisms, 
one  must  specify  whether  a  processor  denied  access  should  (busy)  wait,  or 
suspend  execution  of  the  current  task  and  switch  to  another. 

5.3.1.  Busy -Waiting  Synchronization  Despite  the  potential  for  waste 
in  busy-waiting,  there  are  several  reasons  for  using  it,  including  potentially 
low   overhead   and   applicability   to  situations  where  context  switching  is 
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inappropriate'^.  There  are  many  possible  busy-waiticg  mechanisms;  we  have 
used  counting  semaphores  and  readers/writers  synchronization  extensively, 
and  have  described  algorithms  for  them  before  (Gottlieb  et  al.  [83]). 

Semaphores  are  often  used  to  implement  simple  locks  to  serialize  access 
to  a  small  partition  of  a  larger  concurrently  accessible  data  structure;  for 
example,  the  linked  lists  used  in  the  queues  of  unrestricted  size  of  section 
5.1.1  are  protected  with  semaphores. 

Readers/writers  synchronization  is  used  naturally  in  cases  where 
exclusive  access  is  required  only  infrequently.  Our  implementation  also 
supports  upgrading  a  read  lock  to  a  write  lock  and  downgrading  a  write  lock 
to  a  read  lock.  This  has  been  used  to  implement  search  structures  which 
must  support  the  operation:  "search  for  an  item,  and  insert  it  if  not  found". 
Such  a  structure  uses  linked  lists  accessed  through  a  hash  table,  as  described 
in  section  5.1.2.  By  performing  the  search  with  a  read  lock,  serialization  is 
avoided  in  many  cases  (the  most  important  case  is  where  many  processors 
search  for  the  very  same  item).  Only  if  the  item  is  not  found  is  it  necessary 
to  upgrade  the  lock.  The  upgrade  operation  can  succeed  or  fail;  the 
processors  that  fail  go  back  to  search  again,  while  the  (one)  processor  that 
succeeds  performs  the  insert.  These  search  structures  have  been  widely 
applicable  in  our  prototype  operating  system,  e.g.  in  managing  file  system 
information. 

5.3.2.  N  on -Busy -Waiting  Synchronization  Non-busy-waiting 
synchronization  is  commonly  used  in  multiprogramming  systems  since  it 
allows  a  processor  to  go  on  to  more  useful  work  when  the  progress  of  the 
current  task  is  logically  blocked.  Rather  than  waiting  for  progress  to  be 
possible  again,  the  processor  places  the  current  task  on  a  queue  associated 
with  the  condition  to  be  satisfied,  and  executes  the  next  task  from  the  ready 
queue.  When  the  condition  is  eventually  satisfied,  the  blocked  task  is  moved 
from  the  waiting  queue  to  the  ready  queue.  Although  this  organization  is 
fundamental  to  all  multiprogramming  systems,  the  exact  forms  of  the 
synchronization  primitives  vary. 

The  best  examples  of  the  use  of  non-busy-waiting  synchronization  come 
from  the  area  of  I/O  processing.  This  takes  on  special  significance  for  the 
Ultracomputer,  because  the  number  of  tasks  performing  related  I/O 
operations  can  be  large.  For  example,  searching  of  important  file  system 
directories  would  be  a  bottleneck  if  serialized.  It  is  very  likely  that  a  group 
of  tasks  reading  such  a  directory  (or  other  frequently-accessed  file)  would  all 
simultaneously  attempt  to  read  the  same  disk  block;  serialization  would  be 
devastating.  ~~ 
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'E.g.,  in  the  implementation  of  non-busy-waiting  medianisnu. 
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At  a  low  level,  physical  I/O  devices  require  serialization;  this  is  easily 
provided  by  semaphores.  Apparent  parallelism  for  disks  and  similar  devices 
can  be  achieved  through  the  use  of  in-memory  buffer  cachcing.  Once  a  disk 
block  is  copied  into  a  memory  buffer,  there  is  no  obstacle  to  concurrent 
access  for  reading;  readers/writers  synchronization  is  appropriate  in  this 
situation. 

Events  are  often  associated  with  external  occurrences,  such  as  the 
completion  of  an  I/O  operation  or  the  arrival  of  input  from  a  user's  terminal. 
At  such  a  time,  the  event  is  signaled,  remaining  in  this  state  until  reset. 
Additional  signals  (before  the  reset)  have  no  effect. 

There  is  a  special  difficulty  in  designing  efficient  algorithms  for 
readers/writers  and  events,  because  the  most  natural  implementation  would 
require  a  single  task  to  completely  empty  a  queue,  introducing  a  serial 
bottleneck.  When  a  writer  releases  its  lock,  it  must  awaken  all  waiting 
readers,^'  and  when  an  event  is  signaled,  all  tasks  waiting  for  it  must  be 
awakened.  We  have  developed  two  approaches  to  this  problem.  One 
approach  is  to  move  the  wait  queue  as  a  single  object  onto  the  system  ready 
queue,  where  it  will  be  treated  in  much  the  same  way  as  an  item  with 
multiplicity.  Another  approach  is  to  let  each  newly  awakened  task  "help 
out"  by  waking  up  other  tasks  until  the  wait  queue  is  empty.  This  approach 
is  less  complex  that  the  "queue  of  queues"  method,  but  requires  more  time. 
A  variation  on  this  method  is  to  spawn  a  set  of  high  priority  tasks  (with 
multiplicity  proportional  to  the  size  of  the  wait  queue)  to  assist  with  the  wake 
up. 

As  mentioned  above,  many  events  are  associated  with  physical  I/O. 
Some  I/O  events  might  never  happen,  e.g.  input  from  a  user's  terminal. 
Thus  it  must  be  possible  to  terminate  a  task  that  is  blocked  on  such  an 
event'^.  We  have  developed  a  method  of  premature  unblocking  for  all  of  the 
non-busy -waiting  synchronization  primitives  described  above.  This  requires 
removing  the  task  from  the  middle  of  the  wait  queue  (Wilson  [85]).  In  each 
case,  care  must  be  taken  to  insure  that  the  state  of  the  synchronization 
variable  (e.g.  semaphore  value)  remains  consistent. 

6.   Conclusion 

Previous  shared-memory  multiprocessors  have  been  limited  to  a  low 
degree  of  parallelism.  The  construction  of  highly  parallel  machines  requires 
that  no  serial  bottlenecks  are  introduced  by  either  hardware  or  software.  The 
fetch-and-add  synchronization  primitive  provides  a  simple  and  powerful 
means  for  achieving  this  goal  by  permitting  programmers  to  employ  shared 

"It  is  usually  desirable  for  writers  to  have  priority,  so  the  readers  are  only  to  be 
awakened  if  there  are  no  more  waiting  writers. 

^Other,  more  sophisticated,  actions  are  also  possible,  e.g.  UNIX  signal  handling. 
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data  stnictiires  without  relying  on  critical  sections.  We  believe  that  parallel 
operating  systems  can  be  built  that  perform  all  fimctions  of  resource 
management,  scheduling,  and  coordination  using  critical-section-free 
algorithms.  The  prototype  NYU  Ultracomputer  operating  system  has  been 
designed  to  demonstrate  the  feasibility  of  this  approach. 

Effective  use  of  a  highly  parallel  computer  is  facilitated  by  a  simple 
programming  model  that  permits  writing  of  programs  using  variants  of 
conventional  high-level  procedural  languages.  Further  experience  is  needed 
to  determine  whether  the  operating  system  primitives  that  have  been 
designed  to  date  will  effectively  support  parallel  constructs  for  such 
languages.  In  particular  it  is  crucial  that  these  high-level  mechanisms  not 
introduce  inefficiencies  that  would  prevent  their  widespread  utility.  A 
variety  of  approaches  to  parallel  control  and  scheduling  is  being  studied  for 
the  support  of  a  wide  range  of  parallel  applications. 
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